The skillIQ and roleIQ tests are addictive. I haven’t used Pluralsight to learn and improve my technical skills yet, but I can see how the assessments would drive interaction and frequent improvement of subscribers. What a fun way to encourage personal and professional development!
Using the user_assessment_sessions dataset we can see the distribution of the various metrics for the 6678 user sessions.
A comparison of the assessments is shown below. The consistency initially surprised me, but it makes sense that the distribution across the assessments varies little given the need to provide a standardized evaluation process agnostic to the actual assessment.
A look at the distributions of the user_interactions dataset again shows some variability between assessments (which may be important if I knew more about the methodology), but no more than expected.
The distributions of rd and client_elapsed_time are heavily skewed so I took the log (base 10) of client_elapsed_time and dropped rd altogether. Later (question 5) we see that the question based rd value is generally 30 for the majority of the question interactions. I also removed the records with outliers in the client_elapsed_time; values > 99th percentile = 147260.3 and values < 0.
Per the plot we see that the distributions of client_elapsed_time for the assessments are very similar, though React is slightly shifted to the right of the other three assessments. That may be significant given it’s presented on the log scale. The distributions of the ranking metric vary in a noisy way but still follow the same general structure.
We can evaluate the algorithm’s determination to stop asking questions using a time-series of each assessment. The obvious guess is a minimal threshold for question-to-question changes in the RD value. Something very similar to that guess is confirmed by observing a random sample of several user_assessment_session_ids.
It’s probably worth checking the other metrics associated with a session (display_score, percentile, and ranking) to confirm our suspicions regarding rd as the main variable driving the algorithm. Per the plots below of the same three assessment sessions we see that rd is the only metric of the four that seems an appropriate option.
A closer look at the distribution of the minimum rd values of each assessment’s interaction shows that a simple threshold of 80 drives the stopping rule. Over 75% of the sessions were stopped at a rd value below and very near 80. While 80 seems like an arbitrary value to me, there was likely some empirical and theoretical studies performed to determine that threshold. Also 75% may seem low, but that includes all sessions, even those that were stopped prematurely by the user (as discussed in question 3).
Assuming the threshold for completing the assessment is 80, then the overall dropout rate is around 25% (1608/6678 = 24.07%). The dropout rates for each assessment vary substantially, specifically React at 34.9%, Illustrator at 32.1%, Python at 21.8%, and Javascript at 21.0%.
While the plot below doesn’t quite tell the full story it does help illustrate that it generally takes 18-20 questions answered to obtain the rd threshold of 80. There is a subtle negative slope to the blue points indicating that the users who dropout early (< 10 questions) were likely incorrectly answering the majority of the few questions they actually answered. A closer look using an approach similar to how what I did in question 5 may allow a full test of that hypothesis or relationship.
Javascript has some unusually high number of questions answered at the highest end of the display_score that still don’t quite result in a rd value dropping below the threshold. That may be something worth examining in more detail.
To measure question difficulty I chose to calculate how often (percent) a user answered a given question correctly. The density plots below show the range of question difficulty for each Assessment:Topic combination. Clearly the range of question difficulty varies greatly across the Assessment:Topic combinations. Some topics, like Python: Scalars and Operators and Illustrator: Transforming and Managing Objects, span nearly the entire range of values showing both easy (questions frequently answered correctly) and difficult questions. Other topics, like React: Forms and Javascript: Exceptions, have questions that are generally answered correctly the same percent of the time. For such topics this lack of variability may make it hard to differentiate scores and rankings when compared with topics containing more variety in the difficulty of questions.
The range (max - min) of question difficulty (percent correctly answered) for each Assessment:Topic combination is listed below. I didn’t check the frequency of the questions or of the topics which would affect both the percent correctly answered and possibly the range of difficulty.
| topic_name | range |
|---|---|
| Illustrator | |
| Color | 57.1 |
| Document Setup | 100.0 |
| Drawing and Painting | 100.0 |
| Object Effects & Blending | 100.0 |
| Preparing for Print | 100.0 |
| Selection Tools | 75.0 |
| Transforming and Managing Objects | 85.7 |
| Typography | 100.0 |
| Working with Placed Graphics | 44.2 |
| Javascript | |
| Arrays | 75.0 |
| Basics | 100.0 |
| Exceptions | 21.6 |
| Functions | 41.7 |
| Object Oriented JavaScript | 52.3 |
| Objects | 56.0 |
| Operators | 58.1 |
| Statements | 2.8 |
| Types | 100.0 |
| Unknown | |
| NA | 100.0 |
| Python | |
| Collections | 100.0 |
| Correctness | 69.0 |
| Development Environment | 76.9 |
| Functions | 100.0 |
| Modules | 100.0 |
| Objects | 66.7 |
| Scalars & Operators | 80.0 |
| Strings & IO | 100.0 |
| Syntax | 100.0 |
| React | |
| Components | 68.6 |
| Events and Binding | 33.5 |
| Forms | 8.5 |
| JSX | 30.1 |
| Lifecycle | 9.4 |
| Performance | 39.7 |
| Props | 18.1 |
| State | 22.7 |
| Styling | 9.7 |
| Testing | 25.2 |
There are 724 questions in the dataset. I expect the rd metric to again indicate the certainty floor. A quick look at the distribution of rd values shows that floor to be 30. However, many (71.1%) of the assessment_item_ids show all of their rd values to equal 30. Maybe that’s because those are older questions that reached the floor (30) previous to this dataset.
We would really like to look at all 724 of these questions. We could examine much of the structure using trelliscopejs, a tool for interactively viewing a large collection of visualizations. The key opportunity when using trelliscope is that it allows for creation of a rich feature set that is then used to sort and filter through the data helping us see nuances, outliers, and important features of that data.
A brief description of the cognostics (features) is available by clicking on the “i” in the upper left corner. You can search for interesting assessment_item_ids by using the Sort and Filter buttons on the left hand side. To see those assessment_item_ids that have values of rd other than 30, click on the Filter button, then on the “All RD values = 30” pill, then enter “0” into the right side. This will reduce the total number of panels from 724 to 209. To see panels (plots) where at least two points are present (and thus a plot is created), remain clicked into the Filter button, then click on the “Number of Question Interactions” pill, then enter 2 on the left hand size of the range selection. This immediately removes all the blank panels (not plotted because only one observation exists) and reduces the number of panels from 209 to 180. Clicking on the Filter button again closes that window. You can sort or filter further to test hypotheses or explore the data sliced by assessment_item_id. Happy exploring!
Also note that the plotting panel function can be ggplot or rbokeh based. Here I used bokeh so even within the plot some interactivity exists.